www.rspsciencehub.com Volume 03 Issue 03S March 2021 


INTERNATIONAL RESEARCH JOURNAL ON Se 
ADVANCED SCIENCE HUB 


RSP SCIENCE HUB 


(The Hub of Research Ideas) 
Available online at www.rspsciencehub.com 





BHA WS’ UNO, 


Special Issue of Second International Conference on Advancements in Research and Development 
(ICARD 2021) 


Evaluation of Bufferless Network-On-Chip with Parallel Port Allocator 
Cecil C Nachiar’, Dr.S.Sumathi * 

‘PG student, Dept. of ECE, Adhiyamaan College of Engineering, Hosur, Tamilnadu, India 

“Head of the Department, Dept. of ECE , Adhiyamaan College of Engineering, Hosur, Tamilnadu, India. 
cecilchinnaraj @ gmail.com’ 

Abstract 


Multicore network-on-chip when scales upto hundred of nodes, energy consumption, design complexity 
and cost increases multifold owing to structure of interconnect. Many researches are being conducted to 
design novel architecture to build efficient networks-on-chips. Our paper proposes efficient bufferless 
design with deflection containment technique to eliminate buffers and latency. The high cost of buffers 
motivate us to go for bufferless design, however with increasing network loads, it become notorious with 
multiple deflection and flit loss between nodes. To overcome this, we have designed a_ bufferless 
architecture with local bypass ring within nodes to reduce deflection and packet loss. Deflection 
Containment with the use of local bypass ring shortens critical path and improves performance. 
Architecture of our designed bufferless NoC is analysed and RTL implementation of its components is 
done with Xillinx ISE design suite and its working is analysed in Modelsim SE. Our evaluation proves that 
bufferless routing with deflection containment technique reduces power dissipation without compromising 
on its performance. 

Keywords: Multicore NoC ,Bufferless design, Deflection containment ,critical path 

1. Introduction 

Interconnection fabric becomes important design Process Element (R) Router (—— Link 
parameter in NoC on connecting on _ chip 
components. This ranks higher than traditional 
bus interconnection in scalability and bandwidth 
[1]. In design of NoCs, buffers consume more 
power and occupy larger area leading to high cost. 
Figure | presents the architecture of NoC with 
multicore system. As shown it connect all nodes 
within a chip with network interface and routers. 
Recent works on designing low cost NoCs_ by 
eliminating buffers has been discussed [2-10]. In 
case of buffered network whenever there is a 
congestion or destination port of that is busy, the 
packet remains idle in buffer and waits for 
acknowledgement to transmit. In this process we 
see that link bandwidth is not used unnecessarily. 
However in bufferless NoCs only pipeline 





Fig.1.Typical NoC Architecture 
registers are present which can hold only single flit 
of information. All the incoming flits are 
transmitted through pipeline’ registers and 
whenever destination port of particular flit is 
busy, it is routed to nearby port which is free. 
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When nearer ports are busy ,then it sends packet 
to whichever port is free irrespective of its distance 
to the destined port. So, whenever core increases 
packet injection rate also increases making 
deflection in bufferless router cumbersome 
reducing its performance [3]. Figure 2 shows the 
architecture of buffer and bufferless NoC. 
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Buffered router Burterless router 
Fig.2.Architecture of Buffered and Bufferless 
NoC 


However, various subsequent evaluation has 
brought out problems with adoption of bufferless 
routers [4,5]. Flit ranking and port allocation used 
so far in all designs has increased critical path 
delay as flits are allocated ports sequentially. Flint 
ranking method to avoid livelock during 
transmission such as oldest first etc shown in 
Table 1 has increased complexity in bufferless 
router design. As shown in Figure 3, Whenever 
network load increases , deflection rate increases 
causing flit contention reducing its performance. 
This leads to unnecessary multiple hops to a 
packet to reach its destination making deflection 
containment a stumbling block for building 
efficient routers as shown in Figure 3. 
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Fig.3. Performance of bufferless NoC with 
increasing loads. 
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ABNoC(Approximate Bufferless) [6], is a recent 
work in bufferless NoCs where network conflicts 
has been reduced and packet transmissions are 
done using approximate allocation mechanism. 
Though it improved latency and _ bandwidth 
utilisation as shown in Figure 4, with 
retransmission mechanism it makes the design 
more complex compared to other design. 


Table.1. Various Flit ranking rules 


Flit Ranking Rules 


Oldest First Older flits go before 
younger flits 


Flits closer to their 
destination before flits 
whose remaining 
distance is larger 

Flits that have been 


Closest First 


Most Deflections First 


deflected more before, 
flits deflected fewer 
times. 


Round Robin Flits from different 
input ports are ranked in 
round robin fashion. 


Hybrid Policy In odd cycles use Older 


first and for even 
cycles Round robin 
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Fig.4.Average latency of Various Bufferless 
NoCs 


BLESS (Bufferless NoCs) [7] was the first 
proposed bufferless NoC to eliminate buffers in 
conventional NoCs. Here each node, consists of an 
injection buffer and a reassembly buffer and when 
more than one flit contends for the same output, 
only oldest flit is transmitted and remaining flits 
goes to the port whichever is free. However, this 
age based priority used in BLESS is more 
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deterministic on latency and expensive. CHIPPER 
[8], low complexity buffer has simplified router 
micro architecture by removing in router buffers 
and crossbars. It is based on Golden packet and 
retransmit once algorithms. The reassembly buffer 
size plays very important role in determining its 
efficiency which makes it less desirable. Ring 
based deflection Routers [9,10] is desirable for 
bufferless operation as once flits enters the ring it 
simply reaches its destined port however 
complicates traffic management. MinBD[11], 
DeBAR [12] and SLIDER [13] employs side 
buffers with two stage pipeline registers. The side 
buffers used in this design holds the flits that are to 
be transmitted to busy ports and get reinjected later 
to artbitrate to required output port.AFC[13] and 
FLEXI [14] uses a power gating technique to 
switch to bufferless mode along with conventional 
input buffering technique. 

2.1Microarchitecture Of Proposed Bufferless 
Router 

We have overcome the challenges in previous 
methods and made slight changes in_ their 
architecture which drastically improve _ the 
performance of the router. Figure 5 shows the 
proposed router architecture. It uses two stages 
pipeline where destination ports of incoming flits 
are calculated using conventional routing logic are 
calculated at the initial stage. Once destination 
ports of flits are known, they are passes to sorting 
network logic where flits are ranked based on 
priority. Sorting logic is designed in such a way 
that top channel is__ given highest priority. Once 
flits are sorted , it enters the second stage where 
port allocation is done in parallel depending upon 
the availability of destination ports. Unlike 
conventional bufferless NoCs , instead of 
deflection to other possible nodes we are sending 
contending flits to local bypass channel formed 
within subnetwork of that node. It contends for 
destined port after each clock cycle till it gets its 
turn for transmission. Injection and ejection are 
done till channels are occupied and winning flits 
moves to destined port over crossbar. 

2.2 Flits Ranking By Priority 

Flits ranking by priority becomes the first stage in 
arbitration process which is done by sorting 
network as shown in Figure 5. Flits form four ports 
namely West, East ,South and North are 
prioritised not including flits in bypass port for 
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reducing hardware complexity. Since four ports 
are considered we are considering 2x2 partial 
permutation network with bitonic sorting. It is a 
parallel sorting algorithm where compare and 
exchange operation is done. Sorting is done such a 
way that top channel is with highest priority 
values. As shown in Figure 6, inputs a,b,c and d 
are inputs from west, east, north and south ports on 
sorting yield the output w,x,y and z. Example of 
bitonic sorting 1s shown in Figure 7. 
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Fig.5. DeC Router micro architecture 
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Fig.7.Bitonic sorting algorithm 


2.3 Parallel Port Allocator 

In earlier design, flits are allocated sequentially 
one by one in an order increasing delay in total 
transmission as shown in Figure 8. To overcome 
this, we are using now parallel port allocator where 
ports are assigned in parallel manner. This can be 
proved with RTL _ synthesis in results and 
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discussion. It strictly follows two steps as shown 
below - 

Step 1: Check for status of the destined port of 
particular flit. If that destined port is free without 
contention, then particular flit is transmitted to the 
destined port. 

Step 2: When the destined port of particular flit is 
busy it follows Step 2. Depending upon all other 
flits and their destined port , available ports are 
calculated .With this we are building simple look 
up table to calculate the allocated port. Here we 
have designed it to have a predefined order 
Bypass, North, South ,East and West as shown in 
Figure 9.First available port is given to the flit 
present in the lower channel. 
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Fig.8.Port allocation (a) BLESS (b) Parallel 
port allocator 
Consider an example where flit fo is to be moved 
to south, flit f; is to be moved to north f,, flit f> 1s 
to be moved to east fo, flit f3 1s to be moved to 
north. As per the flowchart in Figure 9, it first 
checks for non-contending ports. Here find fg and 
f2 are non-contending and hence these flits are 
transmitted to south and east ports respectively. 
We also find flits f; anf f; are contending for north 
port. As per our design first contending flit fo 
moves to bypass and and _ f3 moves to north 

port. Thus all flits get transmitted to destined port. 
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Fig.9.Flowchart of parallel port allocator 
Consider an example where flit fp is to be moved 
to south , flit f; 1s to be moved to north f,, flit f> is 
to be moved to east f>, flit f3 1s to be moved to 
north. As per the flowchart in Figure 9, it first 
checks for non-contending ports. Here find fg and 
f2 are non-contending and hence these flits are 
transmitted to south and east ports respectively. 
We also find flits f, anf f; are contending for north 
port. As per our design first contending flit fo 
moves to bypass and and _  f3 moves to north 
port. Thus all flits get transmitted to destined port. 
2.4 Ejection and Injection 
So far we have discussed transmission and 
reception of data along four ports. As discussed 
local bypass ring calls for fifth port bypass port. 
To avoid hardware complexity local ejection and 
injection are done separately from crossbar 
operation. With every increasing cycle, data is 
ejected out and this continues till space in sub 
network 1.e ports exist and if several ports are are 
free during transmission it follows the given 
order. To make our design more contention free 
we have used ejection before injection. 

2.5 Deflection using local bypasss ring 

To reduce the critical delay we partition a network 
present in every node to multiple sub networks 
bridged by bypass ring as shown in Figure 10. 
Partition is done in such a way that datapath width 
remains unchanged. This increases network path 
diversity reducing critical path delay. As shown in 
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Figure 10, each node consists of M subnetworks 
which are of same datapath width connected 
through local bypass ring which is unidirectional 
-Here we have used 2 sub networks for simplicity 
M= 2. 


virtual router [[] physical sub-touter ---)> bypass channel 





Fig.10. Deflection using local bypass ring with 
two sub networks 

A contending flit which is to be deflected in 
conventional NoCs will be ejected to subnetwork 
through local bypass ring reducing multiple hops 
and improving latency.As shown in Figure 11, flits 
fo and f; are contending for east port(3) . Incase of 
BLESS fo is routed to east port while f; is routed 
to north port which again takes 2 hops to reach 
east port (3).So totally it takes 3 hops for complete 
data transmission fg and f;. In our design, fo will be 
routed to east port and f; will be routed to bypass 
port which again in next cycle sends f; to east port 
totalling of 2 cycles of data transmission. Thus 
improving path diversity and latency of our 
proposed architecture. 

Deflection is avoided 


by forwarding jf) to the bypass 
channel. 





(b) In case of DeC: 
1 hop + bypass = P+ 2 cycles 


(a) In case of BLESS: 
3 hops =3*P cycles 
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Fig.11.Reducing hops with local bypass ring 
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3. Results and Discussion 

The proposed design is verified in ModelSim SE 
6.3f .Here we have analysed flits ranking with 
Bitonic sorting , flits prioritisation with parallel 
port allocator. As discussed above the RTL 
implementation of 2x2 bitonic sorting is shown in 
Figure 12 while the output of 2x2 bitonic sorting 1s 
shown in Figure 13. Here we have taken 8 bits in 
which last 4 bits contains the priority bits. Two 
inputs “00011010” and “O1OL11111” are given as 
inputs to a and b. By bitonic sorting , priority bits 
“1010” and “1111” are compared and output x 1s 
given with higher value (01011111) and output y 
is given with lower value (OO011010). 


| Xibra - [98 ~ OA2020 elu code\butferiess-noc\l\Lae - banc 44 age) 4 = 7 ~ T a. v_ WE: i 
Pretest Sow Process Window Help 


0 File Edt View ject Sowce 
PHA UB PASM ALA ae x oo hia 2 &25 0G LRMIAA RiA RG PPRKS 





not Properters 
uo No object tt selected 
o| Sena! +1! Name ae 


rs 
Ore Ly Weve | gg Frdin Fes ill] View by Care 





He fat Www Ade Format Teck Wenn 


ioe 102 AES jul ts Ney N0 Wingy aH itt bt 





aM 

















Fig.13. Simulation of 2x2 bitonic sorting 


As seen in router micro architecture partial 
permutation network for flits priotization is built 
with four 2x2 bitonic nodes attached in cross bar 
format. Flits from east and west ports are given as 
input a and b respectively to uO node whose output 
is al and b1,Flits from north and south ports are 
given as input c and d respectively to ul node 
whose output is cl and dl.The priority bits are 
compared and sorted and given to u2 and u3 whose 
outputs are (w,x),(y,z).Thus received flits from 
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four ports are ranked and sorted. Its RTL 
implementation is shown in Figure 14. and its 
simulation is illustrated in Figure 15. From this 
Figure we can see that “0001011”, “OO001000”, 
“00001101” and “00000111” flits to east, west, 
north and south port respectively are ranked in 
descending order of output ports w (OOO01101), 
x(00001011), y(OOOO1000) and z(O0000111). 
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Fig.14.RTL implementation of Sorting network 
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Fig.15.Simulation of Sorting network in our 
design 
Each router has 4 input and 4 output ports (East, 
West, North, South) and one bypass port in sub 
network for deflection containment. Assuming 8 
bit data last three bits indicate next hop of the flit. 
Consider flit fp) is to be moved to south 
f0(10000100), f; is to be moved to north 
f,(01000011), f2 is to be moved to east 
£2(00100001), f3; is to be moved to north 
f3(00010011). Here , according to DeC design it 
first checks for non-contention 1.e, fp can be moved 
to south port, f2 can be moved to east port. Then 
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we see that f; and f3 are contending for moving to 
north port. Here f1 is moved to bypass port and f4 
is moved to north port.RTL implementation of 
parallel port allocator is shown in Figure 16 and its 
simulation 1 1S shown 1 in Figure Ly, 
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Fig.16.RTL schematic of Parallel port allocator 
in our design 
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Fig.17: Simulation of parallel port allocator in 
our design 


Table 2 presents comparison of various 
bufferless NoCs RTL implementation of router 
in 15nm process. 
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Fig.19.Latency of various design with our 
design of parallel port allocator 


Conclusion 

We make the case that bufferless NoCs can be 
effective way to simplify our design reducing 
latency and improving performance. With the use 
of simple bypass ring in every node divided into 
subnetworks, deflection can be reduced and 
unnecessary hops of flits is removed improving its 
performance. Moreover parallel port allocation 
along with bitonic sorting can reduce delay and 
improve network bandwidth by allocating ports in 
parallel instead of sequential allocation. With 
reduced time and improved bandwidth power 
consumption is reduced and with removal of 
buffers we can achieve scalability 1.e reducing 
area.We have evaluated our design in Modelsim 
and also found that our proposed design is far 
more better than previous works.We have run 
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simulation with 4x4 mesh topology which in 

future will be tested for more number of cores and 

various topologies. 
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